========================================================
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
This report contains red wines data with 11 variables for red wine chemical properties and one variable for expert quality rating between 0 (very bad) and 10 (very excellent). Using this data, we will try to find which chemical properties influence the quality of red wines?
## [1] 1599 12
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
This data contains 12 variables and 1599 observation. All the variables are numeric
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
The summary for the each variable.
##
## Bad good very_good
## 63 1319 217
The quality range is between 3 and 8 with most of the rating in the median (6 or 7) values. I create a new column maned quality.level. This column divided the data into 5 groups (this data set contain 3 groups only).
The fixed acidity is right skewed with the highest values around 7.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
In the second plot I limited the values to 1 to have a closer look to the data. Most of the data is between 0.3 and 0.6.
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 152 9.2 0.52 1 3.4 0.61
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 152 32 69 0.9996 2.74 2
## alcohol quality quality.level
## 152 9.4 4 Bad
There is a lot of data with 0.0 value in the Citric Acid. Also, there is another peak in 0.49. The second plot I subtract the value of 1.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
##
## 0.9 1.2 1.3 1.4 1.5 1.6 1.65 1.7 1.75 1.8 1.9 2 2.05 2.1 2.15
## 2 8 5 35 30 58 2 76 2 129 117 156 2 128 2
## 2.2 2.25 2.3 2.35 2.4 2.5 2.55 2.6 2.65 2.7 2.8 2.85 2.9 2.95 3
## 131 1 109 1 86 84 1 79 1 39 49 1 24 1 25
## 3.1 3.2 3.3 3.4 3.45 3.5 3.6 3.65 3.7 3.75 3.8 3.9 4 4.1 4.2
## 7 15 11 15 1 2 8 1 4 1 8 6 11 6 5
## 4.25 4.3 4.4 4.5 4.6 4.65 4.7 4.8 5 5.1 5.15 5.2 5.4 5.5 5.6
## 1 8 4 4 6 2 1 3 1 5 1 3 1 8 6
## 5.7 5.8 5.9 6 6.1 6.2 6.3 6.4 6.55 6.6 6.7 7 7.2 7.3 7.5
## 1 4 3 4 4 3 2 3 2 2 2 1 1 1 1
## 7.8 7.9 8.1 8.3 8.6 8.8 8.9 9 10.7 11 12.9 13.4 13.8 13.9 15.4
## 2 3 2 3 1 2 1 1 1 2 1 1 2 1 2
## 15.5
## 1
##
## dry off_dry sweet very_sweet
## 2 1357 232 8
When we take the log for the plot, we can see that there are a lot of values without or with very low data. The peak of the data is around 2. After 4 the count of the data per value is less than 10. To have better view for the sugar level I created a new column “sugar.level” based on wine folly website (https://winefolly.com/review/sugar-in-wine-chart/) I divided the sugar level into: dry –> below 1.2 off-dry –> 1.2 - 3 sweet –> 3 - 12 very sweet –> above 12 based on the new sugar levels, most of the wine has off dry sugar level flowed by sweet.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
The second plot shows the log10 of the data where it better shows the outliers. Most of the data id between 0.06 and 0.112.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Both of Free Sulfur Dioxide and Total Sulfur Dioxide are right skewed. limiting the Total Sulfur Dioxide data to 175 give us better image for the data.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
The Density has a normal distributed shape around 0.9965
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
## # A tibble: 3 x 2
## `rw$pH.level` n
## <chr> <int>
## 1 high 48
## 2 low 726
## 3 moderate 825
pH values have a normal distributed shape. To better understand the pH affect I create a new column “pH.level” based on winespectator website (https://www.winespectator.com/drvinny/show/id/5035): 3.3 to 3.6 –> best lower than 3.3 –> low higher than 3.6 –> high most of the pH values are moderate followed by low.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The graph is skewed to the right. The majority of data is between 0.5 and 0.7.
The shape of the data is right skewed with the peak around 7.
there are 1599 red wine records with 11 column for its chemical characteristics (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, Chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol) and one column represent expert rating (quality). All the variables are numeric. Then I create 3 more columns (quality.level, sugar.level, & pH.level)
Other observations: * Most of the quality rating is in the median (6 or 7) with minimum of 3 and maximum of 8 * There is a lot of data with 0.0 value in the Citric Acid. * Most sugar values are around 2 * The Density and pH values are normally distributed * Fixed acidity, sulphates, Free Sulfur Dioxide, and Total Sulfur Dioxide are right skewed.
The main feature in the data set is the quality. I would like to see how other features affect it.
I think sugar, citric acid, and alcohol will effect the quality.
Yes, I create 3 new columns:
quality.level: based on the quality column I divided the quality inti: quality –> quality.level 1 - 2 –> very bad (no values) 3 - 4 –> Bad 5 - 6 –> Good 7 - 8 –> very good 9 - 10 –> excellent (no values)
sugar.level: based on wine folly website (https://winefolly.com/review/sugar-in-wine-chart/) I divided the sugar values into: dry –> below 1.2 off-dry –> 1.2 - 3 sweet –> 3 - 12 very sweet –> above 12
pH.level: based on winespectator website (https://www.winespectator.com/drvinny/show/id/5035) I divided the pH values into: 3.3 to 3.6 –> best lower than 3.3 –> low higher than 3.6 –> high
I log-transformed the residual sugar to have better view of the plot, we can see that there are alot of values without or with very low data. I log-transformed the Chlorides plot where it better shows the outliers and the gap after the first value 0.012.
scatter plot for fixed acidity per quality group. The red line represents the mean. The blue lines represent 0.05 and 0.95 quantiles
Box plot for fixed acidity per quality level
scatter plot for volatile acidity per quality level
box plot for citric acid per quality level
In both graphs the blue lines represent the 0.05 & 0.95 quantiles. The red line represents the mean (0.5 quintile). The lower the volatile acidity the higher the quality. For the fixed acidity and citric acid based on the boxplots the higher the value the better the quality.
The blue lines show the 0. 95 quintile and 0.05 quantile. The red line shows the mean (0.5 quintile)
most of the very good wine id off-dry which mean the residual sugar is below 3 and higher than 1.2. generally speaking most wine have off-dry or sweet regardless to it quality.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
The mean of very good quality wine is lower than the rest.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Boxblot for the free sulfur dioxide after and the qyality level
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Boxblot for the log of free sulfur dioxide after and the qyality level
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Boxblot for the total sulfur dioxide after and the qyality level
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Boxblot for the log of total sulfur dioxide after and the qyality level
Both the total and free sulfur dioxide having a low value is better.
> also for density the very good quality has the lowest mean
The bad quality has the highest mean and the very good quality has the lowest mean. very good quality mean is around 3.25. Most of the pH values is low or moderate
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
> In the sulphates the very good has the highest mean.
clearly that the higher quality has the highest mean of alcohol
I observe the relationship between the quality group and all other variable.
fixed acidity, volatile acidity, and citric acid: Both fixed acidity and citric acid the mean increases in better quality. for volatile acidity is the opposite, the mean decreases in better quality.
residual sugar: Most of the wine values id under the off-dry followed by sweet. There is not a specific characteristic to distinguish between good and bad wine.
chlorides: The mean of very good quality wine is lower than the rest.
total sulfur dioxide and free sulfur dioxide: For both the total and free sulfur dioxide, having a lower value is better.
density and pH: For both the very good quality has the lowest mean.
sulphates: the better the quality the higher the mean
alcohol: the better the quality the higher the mean
No.
the increase in alcohol, sulphates, and citric acid with quality group and the decrease in volatile acidity with the quality group.
## <ScaleContinuousPosition>
## Range:
## Limits: 0 -- 1
## <ScaleContinuousPosition>
## Range:
## Limits: 0 -- 1
The higher the alcohol level the lower the density.
The higher the sugar value the lower the alcohol value. Very good wine has off dry to sweet sugar level and higher value of alcohol
The very sweet wine shows in good quality, the alcohol level is low for the very sweet wine. The density decreases if the sweetness level decrease.
## [1] "The mean of volatile acidity for low pH level:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.180 0.370 0.480 0.494 0.600 1.240
## [1] "The mean of volatile acidity for moderate pH level:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.4200 0.5600 0.5531 0.6600 1.5800
## [1] "The mean of volatile acidity for high pH level:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4675 0.5900 0.6054 0.6625 1.1850
I removed one value which has citric acid value equal to 1. the number of vales with high pH values is very small (48) specially in the very good wine. In the very good quality, the lower the pH level the lower the volatile acidity mean and the higher the citric acid value.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality quality.level sugar.level pH.level
## Min. :3.000 Length:1599 Length:1599 Length:1599
## 1st Qu.:5.000 Class :character Class :character Class :character
## Median :6.000 Mode :character Mode :character Mode :character
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
the value for good and very good quality are similar; however, the very good quality has higher sulphates mean.
pH, volatile acidity, and citric acid: * To have very good wine we need to balance the citric acid value and the pH value. The lowest the pH the higher the citric acid. This relationship is clear in the very good wine only. * for very good wine, low pH values have lower volatile acidity. * The lower the pH level the lower the volatile acidity means.
Free sulfur dioxide, sulphates, & quality group: * the very good quality has highest sulphates mean * there is NO relationship between the free sulfur dioxide and the sulphates.
Density, alcohol, & sugar level: * The density decreases if the sweetness level decrease. * The alcohol level is low for the very sweet wine * Having a better wine mean having higher alcohol level and off-dry to sweet level of sugar
To have very good wine we need to balance the citric acid value and the pH value.
The chlorides percentage is responsible for the saltness of the wine. The plot shows that the very good wine has lower mean.
volatile acidity gives a vinegar taste to the wine, for better wine this value decreases citric acid gives some freshness and flavor to the wine, for better wine this value increases pH, the closer the value to 0 the more acidic. Very good wine has the lower mean To have very good wine we need to balance the citric acid value and the pH value. The lowest the pH the higher the citric acid. This relationship is clear in the very good wine only.
This data contains 1500 red wine records with 11 columns for its chemical characteristics and 1 column for expert rating. I create 3 more columns: quality level, for more classification for the quality rating, sugar level, based on the folly website I divided the sugar values into 4 levels, pH level, based on wine spectator website I divided the pH values into 3 levels. Before I start analysis below short explanation for the chemical characteristics: Fixed acidity: most acids, fixed or nonvolatile involved with wine Volatile acidity: too much give the taste of vinegar Citric acid: give the wine a taste of freshness Residual sugar: sugar value (I divide it into levels in suga level column) Chlorides: salt Free sulfur dioxide: prevent microbial growth and the oxidation of the wine Total sulfur dioxide: free + bound sulfur dioxide Density: the density of the water based on alcohol and sugar level pH: closer to 0 more acidic and closer to 14 more basic Sulphates: antimicrobial and antioxidant Alcohol: alcohol level I plot the variables to better know it distribution. For the quality, most of the wine values are under good quality (5 or 6 rating). For the sugar level, most of the values are under the off dry level. The density and pH values have a normal distribution. Alcohol, sulphates, total sulfur dioxide, free sulfur dioxide, and fixed acidity are right skewed. After that I observed each variable with the quality variable to find of there is a relationship between them. I found that the volatile acidity, chlorides, total and free sulfur dioxide, density, and pH values decrease for better quality. And the fixed acidity, citric acid, sulphates, and alcohol increase with better quality. To Identify if a combined the variables can affect the quality, I used the multivariate analysis and found that for a very good wine quality there is a balance between the citric acid and pH values. The lowest the pH the higher the citric acid. This relationship is clear in the very good wine only. There is other relation between the sugar level and the alcohol value, the higher the sugar level the lower the alcohol value. Having a better wine mean having higher alcohol level and off-dry to sweet level of sugar.